Using a Parameterizable and Domain-Adaptive Information Extraction System for Annotating Large-Scale Corpora?
نویسندگان
چکیده
In this paper we describe a parameterizable and domain-adaptive Information Extraction (IE) system (for German texts) and present some ideas on how this kind of system could effectively support Corpus Linguistics (CL) tasks. We also tentatively address the complementary question and look in which sense corpus linguistics can be beneficial to IE, specially in the case of automatic learning of templates of interest for IE tasks, a topic which is crucial for the further development of highly flexible IE systems. We describe briefly some steps done for the adaptation of the IE system to a new domain in order to illustrate the points where in our opinion IE and CL should go for a
منابع مشابه
A Methodology for Semantically Annotating a Corpus Using a Domain Ontology and Machine Learning
In this paper we present a methodology for the semantic annotation of domain-specific corpora. This method relies on a domain ontology used initially for identifying and annotating domainspecific instances within the corpus. A machine learning-based information extraction system is then trained on the annotated corpus. The final result of this process is a model which is used to annotate new co...
متن کاملDomain-Adaptive Information Extraction
We present in this paper the methodology developed within the PARADIME (Parameterizable Domain-Adaptive Information and Message Extraction) project for designing an Information Extraction (IE) system easily adaptable to new domains of application. For this we went for a strict separation of the (shallow) linguistic processing modules on the one hand and the domain-modeling modules on the other ...
متن کاملAnnotating Corpora from Various Sources in the Humanities Domain
Voula Giouli Annotating corpora from various sources in the humanities domain: shortcomings and issues In this paper, we present work aimed at the linguistic annotation of Greek corpora that belong to the humanities domain, the focus being on the methodological principles as well as the implementation framework adopted. This framework builds on an existin...
متن کاملA New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model
Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...
متن کاملQuick Pad Tagger: an Efficient Graphical User Interface for Building Annotated Corpora with Multiple Annotation Layers
More and more domain specific applications in the internet make use of Natural Language Processing (NLP) tools (e. g. Information Extraction systems). The output quality of these applications relies on the output quality of the used NLP tools. Often, the quality can be increased by annotating a domain specific corpus. However, annotating a corpus is a time consuming and exhaustive task. To redu...
متن کامل